HotSnap: A Hot Distributed Snapshot System For Virtual Machine Cluster
نویسندگان
چکیده
The management of virtual machine cluster (VMC) is challenging owing to the reliability requirements, such as non-stop service, failure tolerance, etc. Distributed snapshot of VMC is one promising approach to support system reliability, it allows the system administrators of data centers to recover the system from failure, and resume the execution from a intermediate state rather than the initial state. However, due to the heavyweight nature of virtual machine (VM) technology, applications running in the VMC suffer from long downtime and performance degradation during snapshot. Besides, the discrepancy of snapshot completion times among VMs brings the TCP backoff problem, resulting in network interruption between two communicating VMs. This paper proposes HotSnap, a VMC snapshot approach designed to enable taking hot distributed snapshot with milliseconds system downtime and TCP backoff duration. At the core of HotSnap is transient snapshot that saves the minimum instantaneous state in a short time, and full snapshot which saves the entire VM state during normal operation. We then design the snapshot protocol to coordinate the individual VM snapshots into the global consistent state of VMC. We have implemented HotSnap on QEMU/KVM, and conduct several experiments to show the effectiveness and efficiency. Compared to the live migration based distributed snapshot technique which brings seconds of system downtime and network interruption, HotSnap only incurs tens of milliseconds.
منابع مشابه
HotRestore: A Fast Restore System for Virtual Machine Cluster
A common way for virtual machine cluster (VMC) to tolerate failures is to create distributed snapshot and then restore from the snapshot upon failure. However, restoring the whole VMC suffers from long restore latency due to large size snapshot files. Besides, different latencies would make the virtual machines not start at the same time. The prior started virtual machine (VM) thus cannot commu...
متن کاملth Large Installation System Administration
A common way for virtual machine cluster (VMC) to tolerate failures is to create distributed snapshot and then restore from the snapshot upon failure. However, restoring the whole VMC suffers from long restore latency due to large snapshot files. Besides, different latencies would lead to discrepancies in start time among the virtual machines. The prior started virtual machine (VM) thus cannot ...
متن کاملUSENIX Association November 9 – 14 , 2014 Seattle , WA Proceedings of the 28 th Large Installation System Administration Conference ( LISA 14 )
A common way for virtual machine cluster (VMC) to tolerate failures is to create distributed snapshot and then restore from the snapshot upon failure. However, restoring the whole VMC suffers from long restore latency due to large snapshot files. Besides, different latencies would lead to discrepancies in start time among the virtual machines. The prior started virtual machine (VM) thus cannot ...
متن کاملLow-Profile Source-side Deduplication for Virtual Machine Backup
This paper presents a source-side backup scheme with low-resource usage through collaborative deduplication and approximated lazy deletion when frequent virtual machine snapshot backup is required in a large-scale cloud cluster. The key ideas are to orchestrate multiround duplicate detection batches among machines in a partitioned asynchronous manner and remove most unreferenced content chunks ...
متن کاملDesigning a Distributed Jvm on a Cluster
dJVM provides a distributed Java virtual Machine (JVM) on a cluster. It hides the distributed nature of the underlying machine from a Java application by presenting a single system image (SSI) to that application. dJVM is based on the Jikes RVM [Alpern et al, 1999] (a JVM written entirely in Java) and is the first distributed implementation of the Jikes RVM. This provides a framework for explor...
متن کامل